Landsat Spectral Clustering

Right click to download this notebook from GitHub.


The image loaded here is a cropped portion of a LANDSAT image of Walker Lake.

In addition to dask-ml, we'll use rasterio to read the data and matplotlib to plot the figures. I'm just working on my laptop, so we could use either the threaded or distributed scheduler, but here I'll use the distributed scheduler for the diagnostics.

In [1]:
import rasterio
import numpy as np
import xarray as xr
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import regrid
import cartopy.crs as ccrs
import dask.array as da
from dask_ml.cluster import SpectralClustering
from dask.distributed import Client
hv.extension('bokeh')
In [2]:
client = Client(processes=False)
client
Out[2]:

Client

Cluster

  • Workers: 1
  • Cores: 8
  • Memory: 17.18 GB
In [3]:
import intake
cat = intake.open_catalog('../catalog.yml')
list(cat)
Out[3]:
['landsat_5_small',
 'landsat_8_small',
 'landsat_5',
 'landsat_8',
 'google_landsat_band',
 'amazon_landsat_band',
 'fluxnet_daily',
 'fluxnet_metadata',
 'seattle_lidar']
In [4]:
landsat_5_img = cat.landsat_5.read_chunked()
landsat_5_img
Out[4]:
<xarray.DataArray (band: 6, y: 7241, x: 7961)>
dask.array<shape=(6, 7241, 7961), dtype=int16, chunksize=(1, 256, 256)>
Coordinates:
  * y        (y) float64 4.414e+06 4.414e+06 4.414e+06 ... 4.197e+06 4.197e+06
  * x        (x) float64 2.424e+05 2.424e+05 2.425e+05 ... 4.812e+05 4.812e+05
  * band     (band) int64 1 2 3 4 5 7
Attributes:
    transform:   (30.0, 0.0, 242385.0, 0.0, -30.0, 4414215.0)
    crs:         EPSG:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)
In [5]:
crs=ccrs.epsg(32611)
In [6]:
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())
buffer = 1.7e4

xmin = x_center - buffer
xmax = x_center + buffer
ymin = y_center - buffer
ymax = y_center + buffer

ROI = landsat_5_img.sel(x=slice(xmin, xmax), y=slice(ymax, ymin))
ROI = ROI.where(ROI > ROI.nodatavals[0])
In [7]:
bands = ROI.astype(float)
bands = (bands - bands.mean()) / bands.std()
bands
Out[7]:
<xarray.DataArray (band: 6, y: 1134, x: 1133)>
dask.array<shape=(6, 1134, 1133), dtype=float64, chunksize=(1, 74, 3)>
Coordinates:
  * y        (y) float64 4.301e+06 4.301e+06 4.301e+06 ... 4.267e+06 4.267e+06
  * x        (x) float64 3.345e+05 3.345e+05 3.345e+05 ... 3.684e+05 3.684e+05
  * band     (band) int64 1 2 3 4 5 7
In [8]:
opts.defaults(
    opts.Image(invert_yaxis=True, width=250, height=250, tools=['hover'], cmap='viridis'))
In [9]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[:3]])
Out[9]:
In [10]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])
Out[10]:
In [11]:
flat_input = bands.stack(z=('y', 'x'))
flat_input
Out[11]:
<xarray.DataArray (band: 6, z: 1284822)>
dask.array<shape=(6, 1284822), dtype=float64, chunksize=(1, 13596)>
Coordinates:
  * band     (band) int64 1 2 3 4 5 7
  * z        (z) MultiIndex
  - y        (z) float64 4.301e+06 4.301e+06 4.301e+06 ... 4.301e+06 4.301e+06
  - x        (z) float64 3.345e+05 3.345e+05 3.345e+05 ... 3.353e+05 3.353e+05
In [12]:
flat_input.shape
Out[12]:
(6, 1284822)

We'll reshape the image to be how dask-ml / scikit-learn expect it: (n_samples, n_features) where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate n_samples x n_samples array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50

In [13]:
X = flat_input.values.astype('float').T
X.shape
Out[13]:
(1284822, 6)
In [14]:
X = da.from_array(X, chunks=100_000)
X = client.persist(X)

And we'll fit the estimator.

In [15]:
clf = SpectralClustering(n_clusters=4, random_state=0,
                         gamma=None,
                         kmeans_params={'init_max_iter': 5},
                         persist_embedding=True)
In [16]:
%time clf.fit(X)
INFO:dask_ml.cluster.spectral:Starting check array
INFO:dask_ml.cluster.spectral:Finished check array
INFO:dask_ml.cluster.spectral:A: 80.00 kB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:B: 1.03 GB, (1, 13) blocks
INFO:dask_ml.cluster.spectral:A2: 80.00 kB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:B2: 1.03 GB, (1, 13) blocks
INFO:dask_ml.cluster.spectral:V2.1: 41.11 MB, (14, 1) blocks
INFO:dask_ml.cluster.spectral:V2.2: 41.11 MB, (13, 1) blocks
INFO:dask_ml.cluster.spectral:U2.2: 41.11 MB, (13, 1) blocks
INFO:dask_ml.cluster.spectral:U2.3: 41.11 MB, (213, 1) blocks
INFO:dask_ml.cluster.spectral:Persisting array for k-means
INFO:dask_ml.cluster.spectral:k-means for assign_labels[starting]
INFO:root:Starting _check_array
INFO:root:Finished _check_array in 0:00:07.593142
INFO:root:Starting init_scalable
INFO:dask_ml.cluster.k_means:Initializing with k-means||
INFO:dask_ml.cluster.k_means:Starting init iteration  1/ 5 ,  1 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  1/ 5 ,  1 centers in 0:00:01.259427
INFO:dask_ml.cluster.k_means:Starting init iteration  2/ 5 ,  5 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  2/ 5 ,  5 centers in 0:00:01.361653
INFO:dask_ml.cluster.k_means:Starting init iteration  3/ 5 ,  7 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  3/ 5 ,  7 centers in 0:00:01.340426
INFO:dask_ml.cluster.k_means:Starting init iteration  4/ 5 , 10 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  4/ 5 , 10 centers in 0:00:01.186982
INFO:dask_ml.cluster.k_means:Starting init iteration  5/ 5 , 12 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  5/ 5 , 12 centers in 0:00:01.411463
INFO:root:Finished init_scalable in 0:00:07.246632
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  0.
INFO:dask_ml.cluster.k_means:Shift: 0.1131
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  0. in 0:00:02.642743
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  1.
INFO:dask_ml.cluster.k_means:Shift: 0.0040
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  1. in 0:00:02.252799
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  2.
INFO:dask_ml.cluster.k_means:Shift: 0.0003
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  2. in 0:00:02.522902
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  3.
INFO:dask_ml.cluster.k_means:Shift: 0.0000
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  3. in 0:00:02.327035
INFO:dask_ml.cluster.spectral:k-means for assign_labels[finished]
CPU times: user 44.5 s, sys: 14.7 s, total: 59.2 s
Wall time: 25.8 s
Out[16]:
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=None, kernel_params=None,
          kmeans_params={'init_max_iter': 5}, n_clusters=4,
          n_components=100, n_init=10, n_jobs=1, n_neighbors=10,
          persist_embedding=True, random_state=0)
In [17]:
labels = clf.assign_labels_.labels_.compute()
labels.shape
Out[17]:
(1284822,)
In [18]:
labels = labels.reshape(bands[0].shape)
In [19]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands]) 
Out[19]:
In [20]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])
Out[20]:
In [21]:
hv.Image(labels)
Out[21]: